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We describe here a framework for a certain class of multiscale 
likelihood factorizations wherein, in analogy to a wavelet decompo- 
sition of an L^ function, a given likelihood function has an alter- 
native representation eis a product of conditional densities reflect- 
t-H , ing information in both the data and the parameter vector localized 

^^ ' in position and scale. The framework is developed as a set of sufH- 

cient conditions for the existence of such factorizations, formulated in 
analogy to those underlying a standard multiresolution analysis for 
Cu ' wavelets, and hence can be viewed as a multiresolution analysis for 

likelihoods. We then consider the use of these factorizations in the 
task of nonparametric, complexity penalized likelihood estimation. 
We study the risk properties of certain thresholding and partition- 
ing estimators, and demonstrate their adaptivity and near-optimality, 
in a minimax sense over a broad range of function spaces, based on 
squared Hellinger distance as a loss function. In particular, our results 
provide an illustration of how properties of classical wavelet-based es- 
vpls ■ timators can be obtained in a single, unified framework that includes 

(^ ' models for continuous, count and categorical data types. 

^^ , 1. Introduction. Wavelet-based methods have had a decided impact on 

1 -^ I the field of nonparametric function estimation in the past decade, particu- 

d • larly where concerned with inhomogeneous objects, as might be encountered 

H ! in applications such as signal and image processing. The near-optimality of 

L^ ' their risk properties (in a minimax sense) and their adaptivity to various 

• i-H . ranges of unknown degrees of smoothness, combined with simple and effi- 

K> ' cient algorithms for practical implementation, have all contributed to this 
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2 E. D. KOLACZYK AND R. D. NOWAK 

impact. See Donoho, Johnstone, Kerkyacharian and Picard (1995), for ex- 
ample, and the discussions therein. 

Much of this work rests upon a framework that assumes a standard Gaus- 
sian "signal-plus-noise model," that is, Xi = f{i/N) + Zi, where the Zi are 
i.i.d. standard normal random variables and the f{i/N) are equispaced sam- 
ples of an unknown function / on the unit interval. This model is then 
combined with an expansion 

(1) fit) = Yl '^j,ki^j,k{t), 

through which the details in / gained between certain approximations at 
scales indexed by j and j ' + 1 (formally, scales 2~^ and 2""^^'), in the 
vicinity of locations indexed by k, are captured by the coefficients cvj^k = 
{f^'4'j,k)- The 'ipj^kit) = 2^''^ip{2H — k) are orthonormal dilations-translations 
of a wavelet function ip{t) satisfying the admissibility condition / ■(/'(t) dt = 0, 
as well as various other conditions on smoothness, symmetry or such as 
desired. 

Little or no work, however, has been done extending the wavelet paradigm 
to certain other common noise models. We have in mind, in particular, 
models for count and categorical data, such as Poisson or multinomial. 
Count data of this sort arises in a variety of contexts, such as high-energy 
astrophysics or medical imaging, while a good example of such categor- 
ical data might be the images found in landcover classification from re- 
mote sensing data. The authors of this paper, in addition to various col- 
laborators, have in recent years pursued a program that seeks to extend 
wavelet-based frameworks in such directions through the use of various mul- 
tiscale probability models [e.g., Kolaczyk (1999a), Timmermann and Nowak 
(1999), Nowak and Kolaczyk (2000) and Kolaczyk and Huang (2001)]. At 
the heart of this program is the concept of a multiscale factorization of a 
given data likelihood, p(X|0), in analogy to the orthogonal wavelet decom- 
position in (1), that is, expressions like 

(2) p(X|0) oc J|p(Xj+i,2fc|Xj,fc,u;j,fc), 

3,k 

where Xj^k contains information in the original data X local to scale j and 
position fc, Xj_|_i 2fc contains information within a refined subregion of that 
and the parameters ixij^k reflect similar information in the original parame- 
ter 6. Note that the pursuit of such factorizations differs from attempting 
to simply work with the likelihood induced by applying a wavelet transform 
to the original data, as the latter approach tends quickly to lead to model 
expressions that suffer from difficulty of interpretation and computational 
intractability outside of the Gaussian case [Kolaczyk (1999b)]. 
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Our goal in this paper is to show that a systematic approach can be 
taken to the topic of multiscale probabihty models, in which for a certain 
class of such models the relevant characteristics of traditional wavelet-based 
models and their extensions are paralleled quite closely. In particular, our 
contribution consists of two related components. First, we show that factor- 
izations like that in (2) arise when conditions for a "multiresolution analysis 
(MRA)" of likelihoods are satisfied, where the conditions are a blending 
of concepts from the fields of wavelets, recursive partitioning and graphi- 
cal models. These conditions are then shown to characterize the Gaussian, 
Poisson and multinomial models. Hence, such multiscale probability models 
provide an example of a unified framework for modeling data of contin- 
uous, count and categorical types in a fashion sensitive to location-scale 
variation. Second, we quantify the risk behavior of certain nonparametric, 
complexity penalized likelihood estimators based on our factorizations. We 
show that a near-optimality and adaptivity completely analogous to that 
of wavelet-based estimators holds for these disparate data types, for appro- 
priately defined smoothness classes, using the squared Hellinger distance 
as a loss function. The technical details behind these results rely on upper 
bounds on the risk in the spirit of recent work by Birge and Massart [e.g., 
Barron, Birge and Massart (1999) and references therein], but these bounds 
are derived by adapting a technique of Li (1999) and Li and Barron (2000). 
In addition to the above two primary contributions, we also comment briefly 
on the algorithmic efficiency with which our various estimators may be cal- 
culated, an indication of their relevance to practice as well as theory. 

The body of the paper is arranged as follows. In Section 2 we present 
our multiresolution analysis for likelihoods. Then in Section 3 we provide 
necessary details for a certain class of models for continuous, count and cat- 
egorical data types, and introduce three estimators of the relevant underlying 
function. Risk properties of these estimators are then stated in Section 4. 
Proofs of these results are detailed in Section 5, and some final comments 
and discussion are compiled in Section 6. Finally, a result on the algorithmic 
complexity of our estimators is proven in the Appendix. 

2. A multiresolution analysis for likelihoods. Consider a stochastic pro- 
cess X{t) on the interval [0, 1) that, either by choice or perhaps by the 
limitations of measuring instruments, is observed only discretely on the in- 
tervals li = [i/N, {i + 1)/N), i = 0, . . . ,N — 1. Furthermore, suppose that 
corresponding to this process is a function 6{t),t£ [0,1). We will assume 
that the effect of the discretization is to yield a vector of measurements X = 
{Xq, . . . , Xn~i), associated with a vector of parameters = {6q, . . . , 9n~i), 
where each pair {Xi,9i) corresponds to the interval /j and is obtained by 
sampling the function 9{-) and then sampling Xi, in a manner to be made 
precise later. We will denote the likelihood of X, given the parameter value 6, 
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by p(X.\6) generically for both discrete and continuous distributions on X 
[i.e., p(-) is to be understood to be defined with respect to an appropriate 
measure i']. At times, when convenient, we may abbreviate this notation 
as pg. 

Informally speaking, a simple yet standard multiscale analysis of the 
data X is achieved by defining the dyadic intervals Ij^k = [^/^-^j (^ + l)/2-^)) 
for j = 0, . . . , J - 1, k = 0,...,2^ -1, and J = logafiV) (i.e., with N as- 
sumed to be a power of 2, for convenience), and associating with each a 
summary statistic Xjj^ = J2i/N£i-k-^i- That is, we define an analysis sep- 
arating the information in X into its components at various combinations 
of scale and position {j,k). This strategy, of course, underlies the analy- 
sis of X with respect to an orthonormal basis of dyadic Haar wavelets, 
specifically, analysis through the discrete inner products of X with func- 

1/2 

tions hj^kii) = (Xj+i,2fc+i(«) - Xj+i,2k{i))/Nj'j^ , defined on the index set 
{0,1,...,A^ — 1}, where Xj,k is the characteristic function for the discrete 
analogue of the interval Jj^, that is, {i:/i C /j^fc}, and Nj^i^. is the cardinal- 
ity of this set. 

This particular notion of multiscale analysis can be generalized by gen- 
eralizing the underlying notion of partitioning. Specifically, beginning with 
the unit interval [0,1), we partition that interval in a recursive fashion, 
where split points are constrained to the endpoints of the original sampling 
intervals /j, until a complete recursive partition (C-RP) V* = {/i}j=o is 
achieved. That is, beginning with the trivial partition [0,1), we split that 
into two pieces at one of the points {i/N}^^ . Then, proceeding in a recur- 
sive fashion, given a partition V intermediate to [0, 1) and V* , we refine that 
partition by splitting one and only one of the intervals I £V at one of the re- 
maining allowable points (i.e., those points in the intersection of {i/N}-J[ 
and the interior of /). We often will call the interval / in such cases the 
"parent" interval, and the two corresponding subintervals, say Ich{i),i ^^'^ 
Ich{i),vi the left and right interval "children." Partitions V produced further 
along in the recursive process than a partition V will be said to be refine- 
ments of "P, which we will denote V <V' (refinement that includes potential 
equivalence will be denoted using "^"). Finally, for a given V ^V* , let T{V) 
be the collection of all intervals I found in at least one partition V' ^V, 
and let Xnt('P) be all such nonterminal intervals [i.e., all intervals / G liV) 
that are not in V itself]. 

The multiscale analysis of X corresponding to V* is then composed of the 
statistics Xj = J2i/N&i -^i^ ^^^ ^11 intervals /SX(P*). This analysis can be 
linked in turn to analysis with respect to an orthonormal basis of so-called 
unbalanced Haar wavelets [Girardi and Sweldens (1997)] 

'Xch{l),r{i) Xch{I)^{^) 



(3) hj{i)=c'j 



^ch(/),r ^ch(/),l 
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where Nj = ^{i : /j C /} is the discrete length of an interval /, and c'j = 
^^ch(i) r ~'~ "^chf/l i)~^ ^^ ^ normalizing constant. Note that the dyadic analy- 
sis above is seen to be the special case in which parent intervals are split only 
into two interval "children" of equal size, that is, Nch{i),\ = -^ch(/),r = -^//2, 
yielding the complete recursive dyadic partition (C-RDP) V^ and the 
dyadic Haar wavelets hj^k- 

Our goal in this section is to show how the above concepts may be used to 
produce a probabilistic analogue of an orthonormal wavelet expansion like 
that in (1). We do so by introducing a formal analogue of the key conceptual 
framework underlying the latter, that is, multiresolution analysis. 

2.1. Development of a formal m,ultiresolution analysis. 

2.1.1. Function space multiresolution analysis. Fundamental to the con- 
cept of wavelets is the notion of a multiresolution analysis (MRA). Briefly, 
the idea behind this method is to construct a sequence of subspaces Vj C 
L^(]R), across scales j, whose members contain successively finer approxi- 
mations to functions / G L^(R). The classical multiresolution analysis [e.g., 
see Daubechies (1992)] requires the following three sets of characteristics of 
these subspaces. 

(A) Hierarchy of nested subspaces. The subspaces Vj satisfy the condi- 
tion 

• • • y_2 C y_i C Fo C Fi C ^2 • • • , 
where rijezVj = {0} and U^^^ = L^ (M) . 

(B) Orthonormal basis within Vq. There exists a function <p such that 
the collection {(/>(• — k)}k£Z forms an orthonormal basis for Vq. 

(C) Scaling between and translation within subspaces. 

g(zVo =^ g(^.-k)£Vo VfcGZ. 

Our interest in the above characteristics (gathered into these three labeled 
categories for our own later convenience of exposition) centers primarily on 
the fact that they form a set of sufficient conditions for the existence of a 
(wavelet) function ^p E L^(M) for which the collection {ipj^k} forms an or- 
thonormal basis of L^(M), as in (1). In other words, these conditions assure 
a multiscale decomposition or "decoupling" of any given function / G L^ (M) 
into components of L^ "energy" localized to certain combinations of scale j 
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and position k. The fact that this decouphng is with respect to an orthonor- 
mal basis imphes that knowledge of these components (i.e., the coefficients 
and their corresponding wavelets) is equivalent to knowledge of the func- 
tion / itself — only the representation has changed. 

2.1.2. Likelihood multiresolution analysis. In analogy to the three con- 
ditions (A)-(C) outlined in Section 2.1.1, we provide four conditions (A*)- 
(D*) sufficient to insure a certain multiscale likelihood factorization. The 
first three conditions will be seen to play roles that parallel those of (A)- 
(C). However, to obtain a factorization fully analogous to an orthonormal 
basis decomposition, the fourth condition (D*) is needed. Our conditions are 
as follows. 

(A*) Hierarchy of recursive partitions. A hierarchy of recursively defined 
partitions 

beginning with [0, 1) and ending with a C-RP V* = {Ii\fS(^. 

(B*) Independence within V* . The components of X are statistically 
independent, the components of are L-independent with respect to the 
likelihood of X, that is, 

p(X|0) = n P{X^\e^), 
i=0 

and the p.d.f. for each Xi is a member of some common parametric family 

j^ = {p{-\e):eGecR}. 

(C*) Reproducibility between partitions. The family J^ is reproducible 
in e, in the sense that, for ah / G I(V*) and V0 G G^^ the p.d.f. of Xj = 

'Ei/Nei^i is p{Xi\6i) G J=', where Bj = Y^i/Nai^i- 

(D*) "Decoupling" of parameters with partitions (i.e., cuts). For any 
Xj~p(-|6'i) ^J^,i& {ii,i2}, there exists some reparameterization {9i^,9i.^) — > 
(6*, a;) such that 

p{Xi,,x,,\ei,A,)=p{x\e)p{x,,\x,oj), 

where X = Xi^ + Xi^ and 6 = 6i^ + Oi^. That is, the sum X is a cut [e.g., 
Barndorff-Nielsen (1978)] for {Xi^,Xi^). 

Some remarks on these conditions are useful prior to stating our main 
result on likelihood factorizations. First, note that through (A*) the notion of 
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multiple resolutions takes the form of recursive partitioning (or, conversely, 
hierarchical aggregation) of our data space. Second, in condition (B*), the 
assumption of a likelihood factorization in the original data space, with 
respect to the index set {0, . . . , A^ — 1}, and in components identical up to 
the parameters 0i, mirrors the spirit and function of the orthonormal basis 
{(/)(• — k)} in Vq. The condition of L-independence requires that the domain 
of variation of 6 be equal to the product of the domains of the components 6i 
and that the role of in the likelihood p(X.\0) can be separated in one-to- 
one correspondence with the statistically independent likelihood components 
of the Xi [Barndorff-Nielsen (1978)]. Third, for each interval / G I{V*), 
it is desirable to combine the information in {Xi^ , • ■ • , Xi^^ } into a single 
summary statistic, in analogy to orthogonal projection of a function onto 
subspaces Vj. We do so using the simplest approach, direct summation, that 
is, X[ = J2i/Nei -^i- Condition (C*) dictates that the distributional family JP" 
is in some sense "invariant" under this summation — a scale-invariance, in a 
sense. Practically speaking, there are in fact a number of similar definitions 
available. We use the very simplest definition here, found, for example, in 
Wilks (1962), which describes the well-known behavior of such distributions 
as the Gaussian, Poisson, Cauchy and others. 

Our perspective in introducing these conditions is one in which we view 
an orthonormal basis decomposition essentially as a "decoupling" of infor- 
mation over some meaningful index space. In the case of wavelets, the index- 
ing (j, k) refers to information local in scale and position. For likelihoods, a 
fully analogous decoupling requires both independence and -L-independence 
in this indexing, such as we assume holds true in the original indexing i 
through condition (B*). Conditions (A)-(C) are sufficient to guarantee a 
multiscale decoupling of a function / in the manner of (1). However, condi- 
tions (A*)-(C*) yield only statistical independence in the multiscale index- 
ing, and not L-independence. Put another way, we have a factorization of 
p(X.\0) into components that are functions of only local information in X, 
but possibly global information in 6. Condition (D*) remedies this situation. 

We now state the main result of this section. 

Theorem 1. Assume that the conditions (A*)-(D*) hold. Then there 
exists a factorization of the form 

(4) p{X\e)=p{XiJ9iJ n P(^ch(/),il^/,^/), 

/SXntCP*) 
with respect to some reparameterization {6J^^g,u} of 6, for /qo = [0,1) and 

a — v^A^— 1 a 
c'/oo — l^i=0 ^i ■ 

Proof of the theorem follows immediately, in light of the conditions and 
the above discussion. Alternatively, this result may be viewed as a conse- 
quence of the fact that conditions (A*) and (B*) imply a so-called directed, 
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local Markov property for the graphical model given by {X/}/£j(--p.\, where 
the underlying graph is just a binary tree T* = T{V*) equipped with arrows 
denoting parent-child relationships. Equation (4) then follows as an exam- 
ple of a recursive factorization [e.g., Lauritzen (1996), Theorem 3.27]. The 
fact that the conditional distributions in the factorization are of the same 
family follows from condition (C*), and the reparameterization follows from 
condition (D*). 

For a random variable X associated with family J^ for which (A*)-(D*) 
are satisfied, we will say that J- allows a likelihood MRA with respect to 9. In 
considering the factorization in (4) , note that the role of a wavelet coefficient- 
function pair (wj^fc, ipj^k)i in capturing detail lost between scales j ' + 1 and j in 
approximating / G L^(M), is played here by the conditional density p{Xc^j-^i\Xi , mi) 
a natural form of expressing the information lost between the aggregations 
dictated by a partition V ~<V* and its immediate predecessor. 

2.2. Characterization. The conditions of Theorem 1 may be used to char- 
acterize certain families J- that allow a multiresolution analysis. We illustrate 
with the canonical case in which J^ is a one-parameter natural exponential 
family (NEF), that is, 

p{Xi\T]i) = a{i]i)h{Xi) expJT^jXj} 

with respect to some sigma-finite measure z^(-), for natural parameter ry £ 
E C M. Specifically, we have the following result. 

Theorem 2. Suppose that T is a (minimal and steep) one-parameter 
NEF. Then it follows that: 

(i) T allows a likelihood MRA with respect to the natural parameteriza- 
tion 6 = r] if and only if T is the family of Gaussian distributions; 

(ii) JT allows a likelihood MRA with respect to the mean parameterization 

9 = fi{ri) if and only if T is either the family of Gaussian distributions or 
the family of Poisson distributions. 

Proof of Theorem 2 follows from the use of results in the literature 
on reproducibility and cuts. One begins by noting that the collection of 
NEFs J- satisfying (C*) must be contained within the collection of such T 
which do so in the case of i.i.d. random variables, that is, where % = ••• = 
r/jv-i = ?7, for some ry G E. A characterization of this latter case, under a 
slight generalization of our own definition of reproducibility, is provided in 
Bar-Lev and Enis (1986). Specifically, among various other results, these au- 
thors show that reproducibility implies that J- must have a power variance 
function (PVF) and that there are only four NEF-PVF families. Exami- 
nation of the cumulant generating function for these four families yields 
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the candidate distributions under cases (i) and (ii) of the theorem. The re- 
sult follows by confirming that the sum of independent random variables 
forms a cut for the joint distribution in the case of the Gaussian and Pois- 
son families, hence satisfying condition (D*), which is straightforward [e.g., 
Barndorff-Nielsen (1978)]. 

Theorem 2 thus establishes formally a role for the Gaussian and Pois- 
son distributions in our class of multiscale probability models. These mod- 
els have been derived from first principles in previous work [e.g., Kolaczyk 
(1999a), Timmermann and Nowak (1999) and Nowak (1999)]. Similarly, a 
moment's thought reveals that the factorization in (4) holds as well for the 
case in which X follows a multinomial distribution, given the appearance 
of that distribution when conditioning a vector of independent Poisson ran- 
dom variables on their total (i.e., -^/qo)- ^^ fact, it can be shown using results 
from the literature on cuts for discrete NEFs [e.g., Barndorff-Nielsen (1978) 
or, alternatively, the work of Joshi and Patil (1970)], that (4) holds for all 
members of the class of sum-symmetric power series distributions (SSPSD), 
that is, where X has probability mass function of the form 

(5) p(X|6>)=6(X) ° ■■■ ^-1 



9(0) ' 

where the generating function g{-) depends on 6 only through 9q-\ h ^Af- 1 • 

The Poisson and multinomial families are two members of this class, with 
the former being the unique member for which the components of X are un- 
correlated. Hence, this result also indicates that the statistical independence 
assumed in condition (B*), while sufficient, is not necessary for the result of 
Theorem 1. 

To what degree these results may be extended remains an open question. 
The above discussion suggests that extensions are unlikely without signifi- 
cant relaxation of conditions (A*)-(D*). While such extensions potentially 
could be interesting, for example, in the event that they may necessarily par- 
allel certain aspects of the "second generation" [Sweldens (1998)] extensions 
of the classical MRA underlying conditions (A)-(C), it is not clear whether 
they would lead to methods of practical interest. 

3. Multiscale penalized maximum likelihood estimation. We now turn 
our attention to the problem of estimating the unknown parameter vector 6 
from data X, when the underlying distributional family allows a likelihood 
MRA. In light of the results of the previous section, for the remainder of 
this paper we will restrict our attention to three models, those of the Gaus- 
sian, Poisson and multinomial families, as canonical examples of models for 
continuous, count and categorical measurements. Additionally, in prepara- 
tion for the results of our risk analysis in Section 4, and the corresponding 
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proofs in Section 5, we will again make use of the function 6{-) underlying 6 
[although the role of X(-) will remain implicit through X]. 

Let be a generic function space to be defined later, such as the space of 
functions of bounded variation or a Besov space. We define our three models 
as follows. 

(G) Gaussian model. Let 9 £@, and define 9i = N Jj_ 9{t) dt to be the av- 
erage of 9 over Ij. Sample the Xi independently as Xi\9i ~ Gaussian(0i,<T^), 
where cr^ is assumed fixed and known. 

The multiscale components in (4) then take the form 

^ch{i),i\'^i,'^i ~ Gaussian I — J^-^^i " '^i-.cia I , 
with 

' '^ch(/),r fch(/),l 



-^ch(/),r ^ch(/),l 

for c/ = -/Vch(7),i-/Vch(/),r/^/- The coarse scale component takes the form X/gp| 
6'/oo ~ Gaussian (6'/oo,A'/oo 0-2). 

(P) Poisson model. Let 61 G 9, where 9{t) e [c,C], Vi G [0,1], for < 
c<C. Define 9i = N Jj, 9{t)dt to be the average of 9 over 7j. Sample the 
Xi independently as Xi ~ Poisson(6'j). 

The multiscale components in (4) then take the form 

-'^ch{7),il^/,'^/ ~ Binomial(X/;a;/), 

with uJi = 9ch(i),\/9i, while the coarse scale component takes the form X/pp |^/p(^ 
Poisson(0/pQ). 

(M) Multinomial model. Let 9^0, where 9{t) G [c, C] Vt G [0, 1], for < 
c<C, and /q 9{t) dt = 1. Define 9i = Jj, 9{t) dt. Let the components Xi arise 
through (singular) multinomial sampling, that is, X ~ Multinomial(n;0), 
for some n^ N . 

The multiscale components in (4) then take the form 

-'^ch{7),il^/,'^/ ~ Binomial(X/;a;/), 

with uJi = 9(.h{i),i/9i, as in the Poisson model, but the coarse scale component 
is now a point mass at n. 

Model (G) is just the Gaussian "signal-plus-noise" model with average- 
sampling of the underlying function 9{-), while model (P) is the Poisson ana- 
logue. Model (M) can be viewed as a discretized density estimation model, 
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where a sample of size n from the density 9 is imphcit. The criterion that N 
be such that n ~ A^ in this model may be satisfied, for example, by se- 
lecting A^ to be the smallest power of 2 greater than or equal to n. This 
choice aids in producing computationally efficient implementations of the 
estimators defined in the next section, and has been found to work well in 
practice. The fact that the Poisson and multinomial models share the same 
multiscale components follows from the shared properties of the SSPSD class 
of distributional families, as explained above. 

3.1. Complexity penalized estimators. The construction underlying the 
multiscale factorization in (4) involves intimate connections between fac- 
torizations, partitions and orthonormal bases, the exploitation of which is 
important for the creation of adaptive estimators and efficient algorithms for 
their calculation. That the factorization is closely linked to recursive parti- 
tioning is clear (i.e., through V*). However, through the latter, the former 
is also linked to a certain class of wavelet bases. To see this it is enough to 
note that, in the construction of any given C-RP V* , the splitting of each 
parent interval I into its two children can be associated with a function hi, 
as defined in (3). This function is a generalization of the dyadic Haar func- 
tion /ijfc defined earlier and the collection of all such hj, for / gTnt(^*)) 
along with a single scale function on the full interval [0,1), are an exam- 
ple of what Girardi and Sweldens (1997) term an "unbalanced" Haar basis 
(UHB). 

The multiscale coefficients ujj in model (G) actually are proportional to 
the UHB coefficients {9, hi), differing only by their constants c/ and c^. 
Therefore, in particular, wj = if and only if {6, hi) = 0. On the other hand, 
in models (P) and (M) the w/ arise as ratios of (left) child to parent sums 
of the appropriate components of 6. However, these ratios can be expressed 
as simple functions of the corresponding Haar coefficients, that is, 

(6) uiJ-^ = di( ^^ ^^'^^^ 



Gl V^ch(/),r ^/ 

Note that uji = Nch(i),i/^i = Pi if and only if {6, hi) = 0. The value of pi is 
the ratio of the (discrete) lengths of the interval / and its (left) child, and 
indicates homogeneity (smoothness) in 6 at the scale and position of /, just 
as LOi = does in the Gaussian-wavelet case. 

Hence the well-known exploitation of "sparseness" associated with wavelet 
expansions in Gaussian denoising problems also carries over to the Pois- 
son and multinomial models, in that a piecewise-homogeneous vector 6 
will have a large proportion of its uji equal to the corresponding pi. This 
point suggests the promise of extensions of "keep-or-kill" thresholding al- 
gorithms to models (P) and (M), in which individual multiscale parame- 
ters uji are either set to their empirical value Xchj-/) i/X/ (i.e., "keep") or 
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to the default value pi (i.e., "kill"), as determined by whether a certain 
criterion function exceeds a threshold or not. In addition, in analogy to 
the results of Donoho (1997), in which the equivalence of a certain form of 
CART [Breiman, Friedman, Olshen and Stone (1984)] and algorithms based 
on constrained thresholding of appropriately defined Haar expansions are de- 
tailed, one might instead choose to base an estimator upon some optimal 
choice of recursive partition V among all such partitions V ^V* , for some 
fixed choice of C-RP V* . For example, the complete recursive dyadic parti- 
tion V^ is a natural choice. Lastly, if one were to consider searching only 
with respect to a given C-RP V* to be too constraining, a natural exten- 
sion is to the entire library C of all (A'^ — 1)! possible C-RP's V* of the 
interval [0, 1). 

We therefore consider here estimators of thresholding, recursive dyadic 
partitioning and (general) recursive partitioning types, generically for each 
of our three models (G), (P) and (M). In the remaining sections we present 
results regarding the risk properties of certain simple versions of these types 
of estimators. Formally, we express the estimators in the form 

(7) eCX) = argmin{-logp(X|0') +2pen(0')}, 

e'er 

(8) pen(6>') = A • #{u;/(0') nontrivial}, 

where "nontrivial" means nonzero in the Gaussian case and not equal to pi 
in the Poisson and multinomial cases, A is a penalty factor to be defined later 
(e.g.. Theorem 4) and F is the space of possible values for a given estimator, 
defined as follows: 



1. thresholding (T), 

(9) TT = le' 

2. recursive dyadic partitioning (RDP), 
(10) 



9'i = Po + Y^ Pihj{i) for I C I^TiV^y) } ; 
lei ) 



rnBP^le'\e'i = Po+ ^ I3ihi{i) iovV^V^A; 



I£Int{P) 
3. recursive partitioning (RP), 



(11) 






/GXnt(P) 



In the definition of our spaces F in (9)-(ll), which we have expressed in 
terms of the UHB functions (3) for simplicity and comparison [and recall 
I-^T^iV) is the set of all nonterminal intervals encountered in the construction 



MULTISCALE LIKELIHOOD ANALYSIS 13 

of a recursive partition P], it is to be understood that the coefficient vec- 
tors (3 are constrained accordingly in each of the models (G), (P) and (M). 
That is, while there are no constraints in model (G), positivity of 0' is re- 
quired in model (P), and both positivity and unit summability in model (M). 
These latter two sets of constraints are enforced naturally in actual com- 
putations by virtue of working in the reparameterization {9'j ,uj'j)- Hence 
the multiscale reparameterizations are important not only algorithmically, 
through their role in the decoupling of terms in (4), but also mathematically 
in enforcing original constraints on 0' in a natural manner. 

Concerning the optimization in (7), by comparing the factorized form of 
the likelihood in (4) with the summability of the penalty in the same multi- 
scale indexing in (8), it is not difficult to see that the estimator G^ is equiva- 
lent to performing a set of independent generalized likelihood ratio tests for 
each / G TNT('PDy)- (Note that our choice of V* = V^ here is arbitrary and 
made for convenience.) For model (G), since the log-likelihood is simply a 
sum of squares, this is in fact penalized least-squares with a counting penalty, 
which reduces to so-called hard-thresholding. On the other hand, for models 
(P) and (M) the index set 2 for the optimal estimate 0t corresponds to 
those indices I for which the null hypothesis Hj :loj = pi is rejected in fa- 
vor of Hj :ujj ^ pj, with respect to the local binomial likelihood functions. 
The estimator 0rdp is the analogue of the penalized least-squares estimator 
defined in Donoho (1997), wherein it is noted that recursive dyadic partition- 
ing estimators like this are in fact thresholding estimators with hereditary 
constraints placed on which components may be "kept" or "killed" (i.e., re- 
sulting from the requirement that the partitioning be recursive). Finally, the 
estimator 0rp is an extension of this framework and reasoning to the larger 
space C. 

On a final note, we mention that all three of the estimators defined above 
may be computed in a computationally efficient manner, as summarized in 
the following. 

Theorem 3. For models (G), (P) and (M), the following hold: 

(i) 0x niay be computed using an 0{N) thresholding procedure; 

(ii) 0RDP ''nay be computed using an 0{N) optimal tree pruning algo- 
rithm; 

(iii) 0RP may be computed using an 0{N'^) dynamic programming algo- 
rithm. 

Proof of the theorem may be found in the Appendix. 



14 E. D. KOLACZYK AND R. D. NOWAK 

4. Risk analysis. We state briefly in this section the main results deriv- 
ing from a risk analysis for the estimators 9t, 0rdp and 0rp under the 
models (G), (P) and (M). In this section and the remainder of the paper, we 
will use X to denote an arbitrary element from the range of the random vari- 
able X. We will measure the loss associated with estimating by in terms 
of the (squared) Hellinger distance between the corresponding densities, that 
is, 

(12) „ 2 

= / [Vp(x|0) - Vp(x|0)J z^(x), 

where ly is the dominating measure. The risk will be defined then as R{6, 6) = 
{1/N)E[L{6,0)], where the expectation is with respect to 0. 

We assign properties to the values of the true through properties of the 
function 6{-) from which it was sampled. Given the role of Haar-like functions 
in our framework, a natural space Q to consider is a ball O = BV(C) of 
functions of bounded variation, that is, for which 

M 

(13) sup sup y^\eitm)-9itm-l)\<C. 



M>2U<-<t 



We then have the following result regarding upper bounds on the risk. 

Theorem 4. Assume Q = BV(C) and the conditions of either model (G), 
(P) or (M). Let the constant X in (8) he of the form '^logN , for 7 > 3/2 
and N >3. Then the risks of the estimators 9^ and 0rp are bounded above 
by 0{{logN/N)'^''^), while the risk of the estimator 0rdp is bounded above 
byO{{log^N/N)^/^). 

Proof of this result and all others given in this section can be found 
in Section 5. Note that the performances of the estimators are bounded 
identically for the three models. A minor variation in the overall proof leads 
to risk statements similar to those of Theorem 4 when loss functions of 
squared-error type are used. 

Corollary 1. The same upper bounds hold when risk is measured for 
the models (G), (P) and (M) as {1/4:N)E[\\6 -d\\^], {l/N)E[\\e^/^ -d^/^W^] 
and {l/N)E[\\{ney/'^ - {nOy/'^f], respectively. 

As mentioned in the Introduction, two key properties of wavelet-based 
thresholding estimators are their near-optimal risk behavior and their adap- 
tivity. To establish near-optimality of our multiscale estimators we need the 
following lower bound on the minimax risk. 
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Theorem 5. Assume the conditions of Theorem 4 and Hellinger loss. 
Then the minimax risk obeys a lower bound of 0{N~'^'^) in each of the 
models (G), (P) and (M). 

Hence, in particular, combining the upper and lower bounds in Theorems 
4 and 5, we obtain that the estimators Ot and 0rp come within the same 
logarithmic factor of minimax risk, while the estimator 0rdp comes within 
the square of that factor, for each of the models (G), (P) and (M). 

Adaptivity of our estimators then follows from the fact that similar near- 
optimality statements hold in other spaces of varying smoothness, despite 
the fact that the estimators have no a priori knowledge of which space(s) con- 
tains 0{-). For example, for an appropriately defined range of Besov spaces 
B^ we have the following. 

Theorem 6. Suppose = i?| „([0, 1]) is a Besov space, with < ^ < 1 
and 1 < p < oo such that 1/p < ^ + 1/2, and q > 0. Then the conclusions 
of Theorems 4 and 5, as well as Corollary 1, hold with the exponent 2/3 
replaced by 2^/(2^ + 1). 

In summary, the results of this section describe, for smoothness classes 
appropriate to Haar-like bases, how all three of our multiscale complexity 
penalized likelihood estimators exhibit the same sort of adaptivity and near- 
optimality properties as classical wavelet-based models — but simultaneously 
for certain continuous, count and categorical data types. As the proofs in 
the next section demonstrate, this success is due primarily to (i) the role of 
Haar-like multiresolution structures in our framework and (ii) the ability to 
decouple the information in the data through likelihood factorizations that 
mirror these structures. Extensions of these results to classes of smoother ob- 
jects (e.g., Besov spaces with ^^ > 1) would seem to require wavelets smoother 
than those in a Haar basis. However, achieving similar likelihood factoriza- 
tions in this context is a much more difficult problem, and one for which 
it seems unrealistic to expect the type of complete multiscale decoupling of 
information in data and parameters inherent in (4). 

5. Proof of main results. We establish proofs of the results in Section 4 
here. Although the case of model (G) could be handled using arguments 
from the existing literature on wavelets and Gaussian noise models, such 
arguments do not immediately extend to the cases of models (P) and (M). 
Hence, since our interest is in part to illustrate how all three models may 
be handled in a unified fashion, we introduce a generally applicable result 
in Section 5.1. The results for model (G), within this general framework, 
are then presented in Section 5.2, while the results for models (P) and (M) 
follow in Section 5.3. 
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5.1. A fundamental risk bound. Our proof of Theorem 4, and hence the 
near-optiniahty of our estimators, rests primarily upon a fundamental up- 
per bound on the expected (squared) Hellinger distance. The form of this 
bound is much like those in Barron, Birge and Massart (1999). However, 
whereas the proof of their (quite general) bounds relies on recent advances 
in isoperimetric inequalities and a great deal of careful technical work, our 
own bounds adapt recent arguments of Li (1999) and Li and Barron (2000) 
for mixture-based density estimation which, in particular, rely on the dis- 
cretization (quantization) of the space T of estimates [in the spirit of earlier 
work by Barron and Cover (1991)] to produce a relatively simple proof, ap- 
plicable to all three models under consideration here. 

Let H'^{pg{i),Pg(2)) be the (squared) Hellinger distance between two den- 
sities p{'x.\6^ ') and p(x|0' '), as introduced earlier. Additionally, define the 
Kullback-Leibler divergence between these densities as 

(14) i^(p,a),P,(.)) = /log^|^^p(x|0«)Kx). 

The following theorem bounds the expected (squared) Hellinger distance in 
terms of the Kullback-Leibler divergence. 

Theorem 7. Let Ftv be a finite collection of estimators 0' for 0, and 
pen(-) a function on T^ satisfying the condition 

(15) Y. e-P''"^^') < 1. 

Let 9 be a penalized maximum likelihood estimator of the form 

(16) ^(X) =arg min {- logp(X|0') + 2pen(0')}- 

0'erjv 

Then 

(17) E[H\p~g,pe)] < min {K{pe,Pe') + 2ven{e')}. 



Proof. The proof uses the same key ideas as found in Li (1999) and 
Li and Barron (2000). Begin by noting that 

^'(P0(i),P0(2)) = / [VpR^- Vp(x|0(^^-'' 



){2)^ 



Z^ X 



(18) =2(^1- /"v'23(x|6'W)p(x|6'(2))i.(x) 

<-21og /y^p(x|0(^))p(x|0(2))j^(x) 
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where v is the dominating measure. Taking the expectation with respect 
to X, we then have 



i^[F2(p.,p0)]<2i?log 



<2£;iog 



1 



J[v/p(x|^Mx|0)]i.(x) 



^pi/2(x|0)e"P^>^W /[Vp(x|0)p(x|6>)] i/(x)/ ' 



where is the argument that minimizes the right-hand side of the expression 
in (17), which is simply the theoretical analogue of Q. However, the last 
expression above may be broken into two pieces, being equal to 

,P(X|0) 



(19) 



(20) 



log 



p(x|0: 

+ 2Slog 



+ 2pen(0) 
/ p^/^(X|0) 



-pen(0) 



\^pi/2(x|6l) /[Vp(x|0)p(x|6>)] i/(x)/ ' 

Noting that the expression in (19) is just the right-hand side of (17), the 
rest of the proof entails showing that the expression in (20) is bounded above 
by zero. Specifically, applying Jensen's inequality we have the upper bound 



(21) 



21ogS 



'pcn(0) 



v/(p(X|0)/p(X|0)) 



/[Vp(x|^)p(x|0)]i/(x)J' 
The integrand in the expectation in (21) can be upper bounded by 

p,,(g,) V(p(x|0O/p(x|0)) 



(22) 



e'erjv 



/[Vp(x|0')p(x|0)]z^(x) 



which, since Q' does not depend on X, produces the following upper bound 
for (20): 



(23) 



2 log y .-pen(e') -gK(p(X|gOMX|0))] 
^ /[Vp(x|0')p(x|0)]j^(x)' 



e'erj, 



Now the integration /[\/p(x|0')p(x|0)] i^(x) in the denominator can be rewrit- 
ten as an expectation with respect to the distribution of X through mul- 
tiplication by p(x|0)/p(x|0). However, this yields the same expectation as 
appears in the numerator of each term in (23), and hence (23) is equal 
to 21og^0/grjve"P''''(^'\ This and the condition in (15) yield that (23) is 
bounded by zero. D 

The result of Theorem 7 holds quite generally, and in particular for each of 
the densities of models (G), (P) and (M). Use of this bound for statements 
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regarding the estimators Ot, Ordp and O^ip requires that the inequahty 
in (15) holds, for which it is sufficient to estabhsh the following. 

Lemma 1. Let Tj\f be the collection of all N -length vectors 0' with com- 
ponents 9[ G D7v[i?i,i?2]. for some Ri < R2, where Dn[Ri,R2\ denotes a 
discretization of the interval [Ri,R2\ into N^'"^ equispaced values. Let i^{0') 
count the number of constant-valued sequences in the vector 6' , that is, in 
analogy to the number of pieces of a piecewise constant function. Then 

(24) Y. e-^i°s^#(^') < 1, 

e'erjv 

for 7 > 3/2 and TV > 3. 

Proof. Begin by writing T^ = UdLi Tjy , where T^j^' is the subset of 
values 0' that is composed of d constant-valued sequences. For example, 
(1,1,2,2,3) and (1,2,3,3,3) might be two such sequences in T\ . Each of 
the members of Fjy has the same value for the summand in (24), and there 
are {N^''^)'^ distinct values that may be taken on by the set of d constant- 
valued components of each member. Also, there are A^ — 1 choose d — 1 
possible d-component vectors of length N. So we have 

V- g-7log7V#(0') = y f^-'^\ ^-(7-l/2)<ilog7V 

e'eVN d.=i ^ 

N-l 



Y ( ^7 "^ ' ^-(7-l/2)K+l)log7V 



d'=0 

^-1 tAT_-i \d' 



< V ^^ ~ ^^ AT~h~l/2)(d'+l) 



d'^ 



d'=0 ^'' 

which is bounded by 1 under the conditions given. D 

In the cases of 0t and 0rp, the number of nontrivial ujj defining the 
penalty in (8) is in fact the same as the penalty in the statement of the 
lemma. The spaces Ft and Frp, with appropriate discretization (to be de- 
fined below), are equivalent to Fat- For the case of ^rdp, it is enough to 
note the inclusion Frdp C Frp . 
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5.2. The Gaussian case. 

Proof of Theorem 4 — model (G). Assume without loss of generality 
that (T = 1. Using inequality (17) and the fact that the Kullback-Leibler 
divergence in the Gaussian case is simply proportional to a (squared) £2- 
norm, we have that 

The minimization in (25) essentially seeks an optimal balancing of bias and 
variance terms, respectively. We will bound this quantity by bounding the 
bias term over Tj^ , for each fixed d, and then optimizing our resulting 
overall bound in d. In producing a bound on the bias, the following result 
from Donoho (1993) is central. 

Lemma 2. Let 9{-) e BV. Define 9{-) to he the best d-term approximant 
to 6{-) in the dyadic Haar basis for L2{[0, 1]). Then \\6 — 6\\l2 = 0{l/d). 

Define 6 to be the average sampling of 6 on the intervals /j, where the 
dependence of on d is to be understood. Then let 6 = [0] be the result of 
quantizing the elements of 6 to the set Di^[—C,C], where C is the radius 
of the BV ball in the statement of Theorem 4. By the triangle inequality it 
follows that 

{i/N)\\e-0'\\l<{i/N)\\e-e\\l + {i/N)fe-~e'\\l 

(26) - - ., 

+ i2/N)\\e-e\\,j\e-e\\,^. 

The first term on the right-hand side is a measure of approximation error 
in ^2(-^)) which may be bounded by the corresponding L2 approximation 
error by exploiting the piecewise constant nature of the Haar basis and 
our use of average sampling. Specifically, let hj^k{i) be the (j, /c)th Haar 
function on the discrete space {0,1,... ,A^ — 1}, as defined in Section 2, 
and let /i5fc(t) be its analogue on the interval [0,1]. Then it follows that 

(0, /ij,fc)^2 = N^^'^iO, K j)l2 and therefore 

{i/N)\\e-~e\\l = {i/N) Y. {{d^Hk)i2-{d,h,,k),,f 

{j,k)eJN 

where Jn is the set of {j, k) with j = 0, 1, . . . , J — 1 and /e = 0, 1, . . . , 2-' — 1. 
However, the last term in (27) is bounded by a similar sum over all (j, /c), 
which in turn is equal to the (squared) L2 approximation error ||^ — ^||^ . 
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Hence the first term in (26) is of order 0{d~^). Tlie second term is 
simply a discretization error, wliich is controlled through the conditions 
of Lemma 1 to be of order 0{1/N). Therefore the cross-term in (26) is 
of order 0{d~^N~^''^). In the case of estimation through the thresholding 
or recursive partitioning strategies, the quantity #(0) will be proportional 
to d. Combining the above results [and ignoring the negligible 0{1/N) term] 
yields the bound 

(28) e'er(,^)l2iV "^^ N ^^ ^ 

< 0{d~^) + 0{d'^N~^/^) + 0{dN~^ logN), 

which is minimized for d ~ {N/ log N)^'^. Substitution then yields the result 
that R{0, 6) is bounded by a quantity of order 0{{\og N / NY'^) . A similar ar- 
gument holds for estimation via recursive dyadic partitioning (RDP), where 
#(0) instead behaves like dlogA^, yielding the bound 0((loge(iV)/iV)^/^). 
D 

Proof of Corollary 1 — model (G). Proof of this corollary, for all 
three models, is inherent in the proof of Theorem 7. Specifically, following 
Li (1999), define the "affinity" between two densities p and q as A{p,q) = 
/(p(x)q'(x))"'^" ;/(x). Then (18) can be rewritten as 

(29) H'^{P0(i),P0(2)) < -2logA{pg(i),pQ(2)), 

and therefore Theorem 7 equivalently can be viewed as a bound on minus 
twice the log-affinity and related quantities thereof. For example, under inde- 
pendent sampling we have that A{pq{i),Pq{2)) = JJiA{p (i),p (2)), and a short 

i i 

calculation shows that for model (G) —2log A{p0^,Pg,) = (l/4)(0j — Oi)'^. D 

Proof of Theorem 5 — model (G). In the Gaussian case this fol- 
lows from standard arguments, as have been used for analogous statements 
for wavelet-based estimators with the Gaussian signal-plus-noise model. See 
Donoho (1993), for example. The approach is based on the so-called method 
of hyperrectangles of Donoho, Liu and MacGibbon (1990). That is, one de- 
fines an object 

(30) W,- = j ^ Pj,khlkit) ■■ m < A,|, 

I fc=0 J 

where the choice of Aj oc 2~^-^'^ is made to produce a hypercube "just 
barely" in our BV ball Q = BV(C), and j = j* is chosen to satisfy the 
constraint 2^-''^ ~ N^'"^ so as to produce the hardest possible estimation 
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problem on such hyperrectangles. Because the £2 risk for estimating objects 
in TCj is simply the sum of the £2 risks for estimating the individual Pj^k in 
this setting, and because this latter can be bounded below by 2-' e^, where 
e oc N~^''^ is the noise level, the lower bound of N~'^''^ on the minimax risk 
follows. D 

Proof of Theorem 6 — model (G). The analogue of Theorem 4 in 
this context is proven simply by modifying the statement of Lemma 2 ac- 
cordingly. That is, for Besov spaces in the range specified, the best d-term 
approximant in the dyadic Haar basis for L2([0, 1]) has an error ||0 — ^H^j of 
order 0{d~^). See DeVore (1998), for example. The effect of this change is 
to change the lead term in (28) from 0{d~'^) to 0(d~^^), and the cross-term 
similarly, from which the upper bounds follow. Corollary 1 then follows as 
before. Finally, proof of Theorem 5 follows as in the case of BV, but with 
the appropriate changes made to the definition of A and j* . D 

5.3. The Poisson and multinomial cases. 

Proof of Theorem 4 — models (P) and (M). As remarked previ- 
ously, the bound in Theorem 7 holds for these two models as well. How- 
ever, the Kullback-Leibler divergence in these cases is not equivalent to an 
^2-norm. Nevertheless, we may pursue a strategy similar to that in the Gaus- 
sian case by bounding the size of the Kullback-Leibler term when evaluated 
at a particular estimate relating to an optimal nonlinear Haar approximant 
associated with the function 9{-). 

Begin with model (P). Let 6{-) be the best d-term nonlinear approximant 
to 6{-), in the sense of Lemma 2, and define 9i = N Jj, 9{t) dt through average- 
sampling. Now the condition that the function 6[-) S [c,C] and the use of 
average-sampling ensure that for the elements of we have Oi G [c, C] as well. 
However, we also have that 9i G [c,C]. This results from the fact that the 9i 
derive from average-sampling the function 9{-), and this latter has a range 
restricted to [c, C]. This last statement follows from noting that, by definition 
of minimizing ||^ — /||l2(o,i) ^^ ^ Haar basis, the function 9 equivalently is 
defined by a set of characteristic functions on some dyadic subintervals / 
that partition [0,1] and associated constants aj = \I\~^ Jj6{t)dt. The a/ 
clearly are bounded by c and C. 

Continuing, let 6 = [6] be the result of quantizing the elements of to 
the set Di\f[c,C]. Then the Kullback-Leibler divergence may be bounded as 

1 1 ^~^ /9 

^i^(p.,P^') = ^E^l-^.+^aog^ 
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N 



=0 

(31) 



^ (^lye'', + ef - 20',e,) 



< ^ llfl fl'l|2 

where the first inequahty follows from log(2:) < z — 1 and the second follows 
from the fact that 6'^ S [c, C] \/ i. The inequality in (31) leaves us in essentially 
the same position with which we started in the Gaussian case. Hence, arguing 
in an entirely analogous fashion we are led to the same inequality as in (28), 
and the result of Theorem 4 is proven for model (P). 

The argument for model (M) is similar in structure to that for model (P), 
but differs with respect to certain important technical details, deriving from 
the fact that 6 must be both positive and sum to 1. Specifically, we begin 
by writing e{t) = 1 + (0(t) - 1) = l+g{t). Then, let g{-) be the best d-piece 
nonlinear Haar approximant to g{-). Since 0G0^(/G0, it follows that 
\\g-~g\\L,=0{d-^). 

Next note that the Haar scale coefficient of g{-) at j = is zero, from 
which it follows that that scale coefficient will be zero for g[-) as well, and 
therefore the latter is defined purely in terms of d nonzero wavelet coeffi- 
cients. Since the wavelets have zero integral, defining 6{t) = 1 + g{t) results 
in an approximant to 9 that integrates to 1. Hence 6[-) will be a proper 
density if it is nonnegative. However, an argument similar to that of the 
Poisson case can be used to show that 5€(c— 1,C — 1) implies the same 
for g{-), from which it follows that 9 G {c,C). 

Define through integration (i.e., noi average-sampling) via 9i = /j. 9{t) dt 

and note that 6 will be a proper probability mass function on the set 
{0,1,...,A^ — 1}, with elements 9i bounded below by c/N. It remains for 
us to quantize in such a manner as to preserve this property, which we 
accomplish by working instead with NO. Noting that N9i G [c.,C] Vi, we 
quantize each of these elements away from zero in the positive (i.e., toward 
+oo) direction on Djv[c, C]. Similar to the Gaussian and Poisson cases, this 
means that | [N9i\ - N9i \ ~ iV'^/^^ ^^y definition of Disi[c, C]. 

Division of [NO] by A'' would produce an object on the same scale as 0, 
with components in [c, C] , but it would no longer be a proper probability 
mass function because our method of quantization leads to an inflation of the 
overall mass by the amount fi = —N + J2iJo [-^^i]- We can correct for this 
increase by subtracting 6 = fi/N ~ 0{N~^''^) from each element of [NO]. 
For N sufficiently large, say c/2 > A^~^/^, our final estimator = {[NO] — 
8)/N is a proper probability mass function and is bounded below by c/{2N). 
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Now, similar to (31), bound the Kullback-Leibler divergence between 
pe and p^' as 

1 1 ^^^ /9- 

i=0 ^'^i 

(32) =^^ei-0,+eaog 

< 0-9 , , 

where the factor of A^ in the numerator of the last line comes from 9'->-^. 
Under the condition n ~ A^ in model (M) , we are left with something that 

behaves like A^||0 — ||| in (32). Similar to the previous cases, this may 
be bounded by terms involving approximation error, discretization (quan- 
tization) error and a cross term, using the triangle inequality. Noting that 
the use of simple integration (as opposed to average-sampling) here leads 
to the alternative relation {6,hj^k)i2 — ^~^ i^i^'j k)L2 between the discrete 
and continuous Haar coefficients, we find that the quantity N\\6 — 6\\'j may 
be bounded above by ||^ — OWi^ = \\g — g\W^ = 0{d~'^). Similarly, by con- 
struction we have ||0 — ||| = 0{N^'^), and so the discretization error is 
again 0{N~^). Therefore, an inequality like that in (28) holds, and the re- 
sult of Theorem 4 is proven for model (M) . D 

Note that, as a consequence of our argument, the condition n^ N arises 
in a natural manner. The interpretation of this condition is, viewed from 
the context of density estimation, that the number of total possible bins N 
should be chosen on the order of the number of samples n. The underlying 
algorithms will choose an optimal number less than or equal to n (in fact, 
likely much less) , due to the fact that the case N > n assures that there will 
be empty bins and these will be aggregated over. 

Proof of Corollary 1 — models (P) and (M). Model (P) involves 
independent sampling, and thus it suffices to note that a short calcula- 
tion produces the expression —2\ogA{pe^,p§,) = {0./ —6.1 )^. Now consider 
model (M), where we have 

■A{pe,Pe) = Y^ 

The right-hand side of the above equation is itself an affinity to the power n, 
since both and sum to 1. Therefore minus twice the logarithm of this 
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quantity is an upper bound on n times the (squared) Hellinger distance 
between these two vectors. D 



Proof of Theorem 5 — models (P) and (M). Our argument here 
is in the spirit of the method of orthogonal hyperrectangles outhned in the 
Gaussian case, but with a number of technical differences. Consider the 
Poisson case first and begin by noting that it is not difficult to show that 
the constraint 6{-),9'{-) £ [c,C] implies that the Kullback-Leibler divergence 
and (squared) Hellinger distance between densities corresponding to and 6' 
are within a constant factor of each other, where the constant is a function 
of c and C. Hence it suffices to provide a lower bound on the quantity 



(33) 



ini sup— E[K{pe, 



Because the Kullback-Leibler divergence is simply an expected log-likelihood 
ratio, and the Poisson model has a multiscale likelihood factorization involv- 
ing binomial conditional probabilities, we find (adopting a dyadic analysis 
and ignoring the coarsest scale term) that 



K{pe,P0' 



(34) 






log 



piX. 



j+l,2k 



Xj,k^^i,k) 



P{Xj+l^2k\Xj^k,^'jk) 



E' 

j,k 



'j,k 



ijJ 



'i,fc 



.•log 



u 



'j,k 



UJ' 



j,k 



+ {l -u;j,fc)log 



■ UJ 



j,k 



■ ijJ 



j,k 



Next, define a hypercube (actually we use just the boundary or shell) in 
u?-space in analogy to that in (30) by specifying (i) 6*0,0 is fixed and known, 
(ii) ujj^k = 1/2 yJT^j*, and (in) Uj^k = 1/2 + s^A for j = j* , where Sk = ±1 
is an unknown sign and A = A(j*) G (0, 1/2) is a perturbation of known 
magnitude. Then, for this particular subproblem, estimation of 9 reduces 
to estimation of the Sk- Since, for appropriately defined values of j* and A, 
our hypercube will be contained within the set of 6 induced by our function 
space = BV(C), it follows using (34) that 



2J 



(35) 



N 



Y^ 6'j*,fcinfsupr(sfc,Sfc 



k=0 



Sk Sk 



lower-bounds the quantity in (33), where 



(36) r{s,s) = E 



+ sA lo, 



1/2 + sA 
1/2 + sA 



+ 



sA ) log 



1/2 -sA 
1/2 - sA 



However, the optimization problem inf sup r(s,s) is equivalent to a stan- 
dard decision problem with binomial observations, a two-point action space, 
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and 0/1 loss. Therefore, restricting attention to estimators of the form 
s = ±1 we find that the expression for r{s, s) can be simphfied to 

•1/2 + A^ 



(37) 



2Alog 



Pr(s/s). 



,1/2 -A, 

Neyman-Pearson theory then yields that each individual probability Pr(sfc 7^ 
Sk) is minimized by the estimator Sk = sgn{Xj*^i^2k — -'^j*+i,2fc+i)) and 
therefore approximately equal to 

(38) p* = Pr {Z < -(0j.+i,2fe - 0j,+i,2k+i)/Ve]^} 

for sufficiently large 9j*+i,2k and 6'j*+i^2fc+ii where Z is a standard normal 
random variable. Note that by construction of our hypercube the value p* 
is the same for all k. 

Finally, we address the issue of choice of A and j* . First note that func- 
tions within our hypercube are simply piecewise constant functions of a fixed 
magnitude. Recalling the definition of the 9j^k by average-sampling and the 
relation 0j+i,2fc = ^j.fc'-^j,A;) simple calculations yield that the height of any 
drop or rise in this function is simply proportional to 2A [where the con- 
stant of proportionality is due to /q 9{t)dt, which may be arbitrarily set 
to 1]. With respect to the total variation seminorm defined in (13), this 
implies that for such functions to be in the space BV(C) we must have 
A < (C/4)2~-' . Next, analogous to the Gaussian case, we choose j* to sat- 
isfy the constraint 2'^-'" = {C /2)N^''^ . This choice can be motivated, for 
example, by selecting that value of j for which the signal-to-noise ratio in 
the empirical Haar coefficients (X.,hj^ig) is 1. 

Combining the expressions in (35), (37) and (38), and exploiting the fact 
that the 9j*^k are equal for all k, we find that our lower bound on the minimax 
risk in (33) behaves like 



(39) ^)>,:o 



2Alog'l/^ + ^ 



P 



1/2 -A, 

To complete our proof, we note that the argument of the probability in (38) is 
in fact the Haar coefficient signal-to-noise ratio (i.e., the ratio of the expected 
value to the standard deviation), and hence p* ~ Pr(Z < —1) may be treated 
as a constant in (39). Additionally, since Ojj. = 9qo/2^ and ^0,0 — ^ lo ^{i) dt, 

it follows that -jj-9j*fl reduces to a constant. Lastly, a Taylor series expansion 
shows that the term within brackets behaves like 8A^. From our definition 
of A and choice of j* it follows that A ~ N~'^''^ , from which the minimax 
lower bound rate of N~'^'^ stated in the theorem follows. 

For the multinomial case, the proof proceeds in an almost identical man- 
ner. As the multinomial likelihood too has a multiscale likelihood factor- 
ization with binomial conditional probabilities, the expression in (34) holds 
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in this case as well, although with 6j^k replaced by nOj^k- The hypercube is 
defined as before, but with 0o,o = 1 by virtue of 9{-) being a density, and the 
same underlying decision problem and optimal solution result. The discus- 
sion leading to our choice of A remains unchanged, and the constraint of 
2^ii'^ ~ N"^/"^ (with appropriately defined constants) again renders p* a con- 
stant. Finally, the quantity to the left of the bracketed expression in (39) now 
is (2-' /N) n9j*fi, which is approximately 1 by the defining condition ?i ~ A^ 
in our specification of model (M). The rest of the argument is identical to 
that of the Poisson case. D 

Proof of Theorem 6 — models (P) and (M). As the proof of Theo- 
rem 4 for models (P) and (M) exploited the L2 approximation error proper- 
ties of optimal nonlinear Haar approximants in a manner analogous to that 
of model (G), proof of Theorem 6 and the other results for these models fol- 
low using precisely the same modifications described earlier for model (G). 
D 

6. Discussion. In this paper we lay a succinct conceptual foundation for 
the existence of certain multiscale likelihood factorizations. We also establish 
adaptivity and near-optimality of certain multiscale complexity penalized 
likelihood estimators, based on these factorizations, through study of their 
risk behavior. A key feature of our formulation and analysis is that canon- 
ical models of continuous, count and categorical data — Gaussian, Poisson 
and multinomial — are handled with common estimators, algorithms and risk 
analysis. The properties of our estimators derive essentially from their ability 
to exploit the fact that the decoupling inherent in the underlying multiscale 
factorizations for these models mirrors the decomposition deriving from an 
associated Haar basis. 

In a sense this paper can be viewed also as providing some degree of expla- 
nation of and justification for the performance of other earlier work by the 
authors and colleagues with multiscale factorizations in specific methodolog- 
ical contexts, such as the analysis of Poisson time series [Kolaczyk (1999a, b)] 
and images [Timmermann and Nowak (1999)], Poisson linear inverse prob- 
lems [Nowak and Kolaczyk (2000)] and the spatial analysis of continuous 
and count data in geography [Kolaczyk and Huang (2001)]. The multiscale 
likelihood approach analyzed here, based on average-sampling of the contin- 
uous object, fits quite naturally in many imaging applications in which the 
instrumentation involves spatially binning photon detections and a Poisson 
model. Similar comments apply in histogram and density estimation con- 
texts involving binned data and the multinomial model. Moreover, it is in 
contexts like that of this last paper that some particularly interesting as- 
pects of the flexibility of the multiscale likelihood framework come to light. 
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For example, in the geographical analysis of census data the notion of "mul- 
tiscale" might arise through a desire to consider the effects of various levels 
of geopolitical aggregation (e.g., towns, counties, states) on, say, changes in 
population dynamics between two decennial censuses. A spatial-temporal 
analysis of this type may be set up and executed using a framework com- 
pletely analogous to that underlying the recursive dyadic partitioning esti- 
mator considered in this paper. See Kolaczyk and Huang (2001) for details. 
Lastly, we include two comments regarding details in our method of proof 
for risk analysis. First, we note the role that our adaptation of results of Li 
(1999) and Li and Barron (2000) plays in our derivation of upper bounds 
for the risk. The bound presented in Theorem 7 is quite general, and our 
subsequent usage of that bound suggests its usefulness in other problems 
of complexity penalized likelihood estimation in the nonpar ametric context. 
Second, we point out that, while our proof of the upper bounds on the 
risk (i.e.. Theorem 4) does not explicitly use the decoupled structure of our 
likelihood factorizations, this structure does play a key role in our adaptation 
of the method of hyperrectangles for providing lower bounds on the risk (i.e.. 
Theorem 5) in the Poisson and multinomial cases, wherein Kullback-Leibler 
divergences are simply £2 risks in the standard Gaussian sequence model. 

APPENDIX 



Proof of Theorem 3. By virtue of the hkelihood factorization (4), 
and the additivity of penalty function (8) in the multiscale indexing I G 
Xnt(7'*), for all three estimators (9), (10) and (11) the objective function to 
be optimized in (7) is simply of the form X]/eXNT(7'*) ^^' ^^^^^ ^^^ ^i ^^^ 
functions of the data X and parameters 0. In particular, for each estimator 

there is at each / a choice being made between Hj : {ujj is trivial} , versus 

Hj : {loj is nontrivial}. The Wj are either the negative log- likelihood under 

the null Hj or that under the alternative Hj plus a penalty in the amount 
of 2A. 

In the case of thresholding, determining the value for Wj for each / corre- 
sponds to performing N independent generalized likelihood ratio tests, and 
hence is of 0{N) algorithmic complexity trivially. 

The case of recursive dyadic partition estimators follows arguments paral- 
lel to those in Donoho (1997). Specifically, any RDP of [0, 1) can be matched 
in one-to-one correspondence with a dyadic Haar basis in which there is 
a "hereditary" constraint on which coefficients are "kept" and which are 
"killed." That is, if a coefficient is to be kept, all of its ancestors must 
be kept as well; conversely, if a coefficient is killed, all of its descendents 
are killed as well. Hence, searching for an optimal RDP, say V :< V^ , is 



28 E. D. KOLACZYK AND R. D. NOWAK 

equivalent to a type of constrained thresholding. The constraint may be 
enforced by recursively moving from fine scales to coarse, and at each in- 
terval Ij^k choosing the optimal subpartition V{Ij,k) on Ij^^ to be either 
(i) the union of the optimal subpartitions V{Ij+i^2k) and T'{Ij+i^2k+i) on 
the children of Ij^k or (ii) the trivial subpartition in which Ij^k is parti- 
tioned no further. These decisions are based on the same type of generalized 
likelihood ratio tests underlying the thresholding estimators, indexed in the 
Ij,k G 2rNx('PDy)- Therefore, this is the same as the optimal pruning algorithm 
for CART, which requires on the order of N operations [Donoho (1997) and 
Breiman, Friedman, Olshen and Stone (1984), algorithm 10.1, page 294]. 

Finally, we consider the case of the general recursive partitioning estima- 
tors. First note there are a total of — ^ — - unique subintervals I C [0, 1) that 
may be composed of the N finest resolution intervals Ij. For any length Nj > 
1 , there are exactly A^ — Nj + 1 of these subintervals of length Nj . Also note 
that any interval of cardinality Nj = m may be partitioned into two children 
intervals in exactly m — 1 ways. Therefore, in total, among the (A — 1)! possi- 
ble C-RPs, there are only E^=i m{N -m) = ^^^ - ^(^+mN+i) ^ ^ 
unique parent-child pairs. By exploiting both this redundancy and the inher- 
itance property underlying the RDP case, an efficient dynamic programming 
algorithm can be obtained. Here, however, the end result of the algorithm is 
not only an optimal partition V, but an accompanying C-RP V*{V) as well. 
Beginning with intervals / of cardinality A/ = 2, and working recursively 

over A/ = 3,4, ... , one can compute Wi under both hypotheses H\ and 

Hj , and pass the optimal decision (i.e., partition or not) for each interval 
"upward" (i.e., toward the coarsest interval [0,1)). That is, for a given in- 
terval I of length A/ = m [which may or may not appear in the definition of 
the final C-RP P*('P)], we determine and record the optimal subpartition 
V{I) on / and the associated optimal sub-C-RP 'P*{V{I)), and we record 
the complexity value 

(40) Y. ^i'- 

i'eit,T{r*{v{i)) 

This optimal (sub)partition and its complexity value can be found via a 
maximization over 0{m) terms involving the corresponding optimal subpar- 
titions and complexity values determined previously for each of the m — 1 
pairs of possible children of /. The maximization over all possible blocks in 
all possible C-RPs therefore requires roughly A^/6 comparisons. Once we 
reach the top, we are left with V and V*{V). D 

Acknowledgments. The authors thank Andrew Barron and David Donoho 
for helpful discussions. They also thank the referees, Associate Editors and 
Editors involved at various stages for their comments and suggestions lead- 
ing up to the final version of this paper. 



MULTISCALE LIKELIHOOD ANALYSIS 29 

REFERENCES 

Bar-Lev, S. K. and Enis, P. (1986). Reproducibility and natural exponential families 

with power variance functions. Ann. Statist. 14 1507-1522. MR868315 
Barndorff-Nielsen, O. (1978). Information and Exponential Families in Statistical The- 
ory. Wiley, New York. MR489333 
Barron, A., Birge, L. and Massart, P. (1999). Risk bounds for model selection via 

penalization. Probab. Theory Related Fields 113 301-413. MR1679028 
Barron, A. R. and Cover, T. M. (1991). Minimum complexity density estimation. IEEE 

Trans. Inform. Theory 37 1034-1054. 
Breiman, L., Friedman, J., Olshen, R. and Stone, C. J. (1984). Classification and 

Regression Trees. Wadsworth, Belmont, CA. MR1111806 
Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia. MR1162107 
DeVore, R. a. (1998). Nonlinear approximation. In Acta Numerica 7 51-150. Cambridge 

Univ. Press. MR1689432 
DONOHO, D. L. (1993). Unconditional bases are optimal bases for data compression and 

for statistical estimation. Appl. Comput. Harmon. Anal. 1 100-115. MR1256530 
DONOHO, D. L. (1997). CART and best-ortho-basis: A connection. Ann. Statist. 25 1870- 

1911. MR1474073 
DoNOHO, D. L., Johnstone, I. M., Kerkyacharian, G. and Picard, D. (1995). Wavelet 

shrinkage: Asymptopia? (with discussion). J. Roy. Statist. Soc. Ser. B 57 301-369. 

MR1323344 
DONOHO, D. L., Liu, R. and MacGibbon, B. (1990). Minimax risk over hyperrectangles, 

and implications. Ann. Statist. 18 1416-1437. MR1062717 
GiRARDi, M. and Sweldens, W. (1997). A new class of unbalanced Haar wavelets that 

form an unconditional basis for Lp on general measure spaces. J. Fourier Anal. Appl. 

3 457-474. MR1468375 
JOSHI, S. W. and Patil, G. P. (1970). A class of statistical models for multiple counts. 

In Random Counts in Scientific Work (G. P. Patil, ed.) 2 189-203. Pennsylvania State 

Univ. Press. MR287599 
KOLACZYK, E. D. (1999a). Bayesian multiscale models for Poisson processes. J. Amer. 

Statist. Assoc. 94 920-933. MR1723303 
KOLACZYK, E. D. (1999b). Some observations on the tractability of certain multi-scale 

models. In Bayesian Inference in Wavelet-Based Models. Lecture Notes in Statist. 141 

51-66. Springer, New York. MR1699876 
KOLACZYK, E. D. and Huang, H. (2001). Multiscale statistical models for hierarchical 

spatial aggregation. Geographical Analysis 33 95-118. 
Lauritzen, S. L. (1996). Graphical Models. Oxford Univ. Press. MR1419991 
Li, Q. J. (1999). Estimation of mixture models. Ph.D. dissertation, Dept. Statistics, Yale 

Univ. 
Li, Q. J. and Barron, A. R. (2000). Mixture density estimation. In Advances in Neural 

Information Processing Systems 12 279-285. MIT Press, Cambridge, MA. 
NowAK, R. D. (1999). Multiscale hidden Markov models for Bayesian image analysis. In 

Bayesian Inference in Wavelet-Based Models. Lecture Notes in Statist. 141 243-265. 

Springer, New York. MR1699845 
NowAK, R. D. and Kolaczyk, E. D. (2000). A statistical multiscale framework for Poisson 

inverse problems. IEEE Trans. Inform. Theory 46 1811-1825. MR1790322 
Sweldens, W. (1998). The lifting scheme: A construction of second generation wavelets. 

SIAM J. Math. Anal. 29 511-546. MR1616507 



30 



E. D. KOLACZYK AND R. D. NOWAK 



TiMMERMANN, K. E. and NowAK, R. D. (1999). Multiscale modeling and estimation of 
Poisson processes with application to photon-limited imaging. IEEE Trans. Inform. 
Theory 45 846-862. MR1682515 

WiLKS, S. S. (1962). Mathematical Statistics. Wiley, New York. MR144404 



Department of Mathematics 

AND Statistics 
Boston University 
Boston, Massachusetts 02215 
USA 
E-MAIL: kolaczykOmath.bu.cdu 



Department of Electrical 

AND Computer Engineering 
University of Wisconsin 
Madison, Wisconsin 53706 
USA 
E-MAIL: nowak@encr.wisc.odu 



